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Abstract 

Open domain relation extraction systems iden¬ 
tify relation and argument phrases in a sen¬ 
tence without relying on any underlying 
schema. However, current state-of-the-art re¬ 
lation extraction systems are available only 
for English because of their heavy reliance 
on linguistic tools such as part-of-speech tag¬ 
gers and dependency parsers. We present a 
cross-lingual annotation projection method for 
language independent relation extraction. We 
evaluate our method on a manually annotated 
test set and present results on three typolog- 
ically different languages. We release these 
manual annotations and extracted relations in 
61 languages from Wikipedia. 


1 Introduction 


Relation extraction (RE) is the task of assigning a 
semantic relationship between a pair of arguments. 
The two major types of RE are closed domain and 
open domain RE. While closed-domain RE systems 


(Bunescu and Mooney, 2005 

Bunescu, 2007 

Mintz 

et al., 20091 |Yao and Van Durme, 2014 Berant 

and Liang, 20141 consider only a closed set of re- 
lationships between two arguments, open domain 

systems (Yates et al., 2007 

Carlson et al., 20101 


bitrary phrase to specify a relationship. In this pa¬ 
per, we focus on open-domain RE for multiple lan¬ 
guages. Although there are advantages to closed 
domain RE (|Banko and Etzioni, 20081, it is expen¬ 


sive to construct a closed set of relation types which 
would be meaningful across multiple languages. 

Open RE systems extract patterns from sentences 
in a given language to identify relations. Eor learn¬ 


ing these patterns, the sentences are analyzed using a 
part of speech tagger, a dependency parser and pos¬ 
sibly a named-entity recognizer. In languages other 
than English, these tools are either unavailable or not 
accurate enough to be used. In comparison, it is eas¬ 
ier to obtain parallel bilingual corpora which can be 


used to build machine translation systems (Resnik 
and Smith, 2003t[Smith et al., 2013| ). 


In this paper, we present a system that performs 
RE on a sentence in a source language by first trans¬ 
lating the sentence to English, performing RE in En¬ 
glish, and finally projecting the relation phrase back 
to the source language sentence. Our system as¬ 
sumes the availability of a machine translation sys¬ 
tem from a source language to English and an open 
RE system in English but no any other analysis tool 
in the source language. The main contributions of 
this work are: 


• A pipeline to develop relation extraction sys¬ 
tem for any source language. 

• Extracted open relations in 61 languages based 
on Wikipedia corpus. 

• Manual judgements for the projected relations 
in three languages. 

We first describe our methodology for language 
independent cross-lingual projection of extracted re¬ 
lations (Q followed by the relation annotation pro¬ 
cedure and the results (Q. The manually anno¬ 
tated relations in 3 languages and the automati¬ 
cally extracted relations in 61 languages are avail¬ 
able at: http://cs.cmu.edu/~mfaruqui/ 
soft.html 










































Maria no abofeteo a la bruja verde 

I N I I X 

Maria did not slap the green witch 

argi relation arg2 

Marfa no abofeteo a la bruja verde 

argi relation arg2 


|Translation 

I Relation 
I Extraction 

I Projection 


Figure 1; RE in a Spanish sentence using the cross- 
lingual relation extraction pipeline. 


2 Multilingual Relation Extraction 


Our method of RE for a sentences = (si, S 2 , ■ ■ • sn) 
in a non-English language consists of three steps: (1) 
Translation of s into English, that generates a sen¬ 
tence t = (fi, ^ 2 , ■ ■ • fju) with word alignments a 
relative to s, (2) Open RE on t, and (3) Relation pro¬ 
jection from t to s. Eigure[^shows an example of RE 
in Spanish using our proposed pipelinej^We employ 
OLLl4^ {Mausam et ah, 2012 1 for RE in English and 
Google TRANSLAX^ijAPI for translation from the 
source language to English, although in principle, 
we could use any translation system to translate the 
language to English. We next describe each of these 
components. 


2.1 Relation Extraction in English 


Suppose t = {ti,t 2 , ■ ■ ■, Im) is a tokenized English 
sentence. Open relation extraction computes triples 
of non-overlapping phrases (argi; rel; arg2) from 
the sentence t. The two arguments argi and arg2 
are connected by the relation phrase rel. 

We utilized Ollie ( |Mausam et ah, 2012| | to ex¬ 
tract the relation tuples for every English sentence. 
We chose Ollie because it has been shown to 
give a higher yield at comparable precision rela¬ 
tive to other open RE systems such as Reverb 
and WOEP^''*® ( Mausam et ah, 20T^ . Ollie was 
trained by extracting dependency path patterns on 
annotated training data. This training data was boot¬ 
strapped from a set of high precision seed tuples ex¬ 


tracted from a simpler RE system Reverb (Eader 


_ jXhis is a sample sentence and is not taken from Wikipedia. 

_ http://know!tall.github.io/ollie/ 

_I https://developers.google.com/ 

translate/ 


Data: s, t, a, pt 
Result: pa 

P /— PhraseExtract(s, t, a) 

Ps = 0, score = —oo, overlap = 0 
for {phrs,phrt) G P do 

if BLEU{phrt,pt) > score then 
if phrt n Pt / 0 then 
Pt phrt 

score /— BEEU(p/irt,pt) 
overlap /— phrt n pt 

if overlap / 0 then 
length = oo 
for {phrs,Pt) G P do 

if len{phrs) < length then 
length /— len{phrs) 

\_ Ps ^ phrs\ 

else 

\_ Ps WordAlignmentProj(s, t, a, pt); 
Algorithm 1: Cross-lingual projection of phrase pt 
from a target sentence t to a source sentence s using 
word alignments a and parallel phrases P. 


et ah, 201 ll. In Godse killed Gandhi, the ex¬ 


tracted relation (Godse; killed; Gandhi) can be ex¬ 
pressed by the dependency pattern: argi t nsubj t 
rel:postag=VBD dobj j, arg2|^ Ollie also nor¬ 
malizes the relation phrase for some of the phrases, 
for example is president of is normalized to be pres¬ 
ident of. 1^ 


2.2 Cross-lingual Relation Projection 

We next describe an algorithm to project the ex¬ 
tracted relation tuples in English back to the source 
language sentence. Given a source sentence, the 
Google Translate API provides us its transla¬ 
tion along with the word-to-word alignments rela¬ 
tive to the source. If s = and i = t^ denote 
the source and its English translation, then the align¬ 
ment a = {atj '■ 1 < i < N]1 < j < M} where. 


Example borrowed from Mausam et al. (2012 1 
^For sentences where the veracity of a relation depends on 
a clause, OLLIE also outputs the clause. For example, in Early 
astronomers believed that Earth is the center of the universe, 
the relation (Earth; be center of; universe) is supplemented by 
an {AttributedTo: believe; Early astronomers) clause. We ignore 
this clausal information. 






































ttij = 1 if Sj is aligned to tj, and is 0 otherwise. A 
naive word-alignment based projection would map 
every word from a phrase extracted in English to the 
source sentence. This algorithm has two drawbacks: 
first, since the word alignments are many-to-many, 
each English word can be possibly mapped to more 
than one source word which leads to ambiguity in its 
projection; second, a word level mapping can pro¬ 
duce non-contiguous phrases in the source sentence, 
which are hard to interpret semantically. 

To tackle these problems, we introduce a novel 
algorithm that incorporates a BEEU score ([Papineni 


et ah, 2002 1 based phrase similarity metric to per¬ 
form cross-lingual projection of relations. Given 
a source sentence, its translation, and the word- 
to-word alignment, we first extract phrase-pairs P 
using the phrase-extract algorithm (Och and Ney,| 


2004 1 . In each extracted phrase pair {phrs,phrt) G 


P, phvs and phrt are contiguous word sequences in 
s and t respectively. We next determine the trans¬ 
lations of argl, rel and arg2 from the extracted 
phrase-pairs. 

Eor each English phrase p G {argl, rel, arg2}, we 
first obtain the phrase-pair {phrs,phrt) G P such 
that phrt has the highest BEEU score relative to 
p subject to the condition that p n phrt / 0 i.e, 
there is at least one word overlap between the two 
phrases. This condition is necessary since we use 
BEEU score with smoothing and may obtain a non¬ 
zero BEEU score even with zero word overlap. If 
there are multiple phrase-pairs in P that correspond 
to the same target phrase phrt, we select the shortest 
source phrase iphrs). However, if there is no word 
overlap between the target phrase p and any of the 
target phrases in P, we project the phrase using the 
word-alignment based projection. The cross-lingual 
projection method is presented in Algorithm[T] 


(1) or invalid (0) for the sentence. Eor exam¬ 
ple, in the sentence: “Michelle Obama, wife of 
Barack Obama was born in Chicago”, the follow¬ 
ing are possible annotations: a) (Michelle Obama; 
born in; Chicago): 1, b) (Barack Obama; born in; 
Chicago): 0. Such binary annotations are not avail¬ 
able for languages apart from English. Eurther- 
more, a binary 1/0 label is a coarse annotation that 
could unfairly penalize an extracted relation which 
has the correct semantics but is slightly ungrammat¬ 
ical. This could occur either when prepositions are 
dropped from the relation phrase or when there is an 
ambiguity in the boundary of the relation phrase. 

Therefore to evaluate our multilingual relation ex¬ 
traction framework, we obtained annotations from 
professional linguists for three typologically differ¬ 
ent languages: Erench, Hindi, and Russian. The an¬ 
notation task is as follows: Given a sentence and 
a pair of arguments (extracted automatically from 
the sentence), the annotator identifies the most rel¬ 
evant contiguous relation phrase from the sentence 
that establishes a plausible connection between the 
two arguments. If there is no meaningful contigu¬ 
ous relation phrase between the two arguments, the 
arguments are considered invalid and hence, the ex¬ 
tracted relation tuple from the sentence is considered 
incorrect. 

Given the human annotated relation phrase and 
the automatically extracted relation phrase, we can 
measure the similarity between the two, thus alle¬ 
viating the problem of coarse annotation in binary 
judgments. Eor evaluation, we first report the per¬ 
centage of valid arguments. Then for sentences with 
valid arguments, we use smoothed sentence-level 
BEEU score (max n-gram order = 3) to measure the 
similarity of the automatically extracted relation rel¬ 
ative to the human annotated relation^ 


3 Experiments 

Evaluation for open relations is a difficult task with 
no standard evaluation datasets. We first describe the 
construction of our multilingual relation extraction 
dataset and then present the experiments. 

Annotation. The 

for open relations 
ah, 2012] ) is to extract relations from a sentence 
and manually annotate each relation as either valid 


Results. We extracted relations from the entire 
Wikipedi^ corpus in Russian, Erench and Hindi 
from all sentences whose lengths are in the range 
of 10 — 30 words. We randomly selected 1,000 
relations for each of these languages and annotated 
them. The results are shown in table [T] The percent- 

®We obtained two annotations for ~ 300 Russian sentences. 
Between the two annotations, the perfect agreement rate was 
74.5% and the average BLEU score was 0.85. 

'www.Wikipedia.org 


current approach to evaluation 
Eader et ah, 201 1| Mausam et 
















Language 

Argument 1 

Relation phrase 

Argument 2 

French 

11 

fut enrole de force au 

RAD 

He 

was conscripted to 

RAD 

Hindi 

bahut se log 

aaye 

cailifornia 

Many people 

came to 

California 

Russian 

AuTOKaTacTpocha 

npoHSomjia 

HepnoropHH 

Crash 

occured 

Montenegro 


Table 3; Examples of extracted relations in different languages with English translations (Hindi is transliterated). 


Language 

% valid 

BLEU 

Relatic 

Gold 

)n length 
Auto 

French 

81.6% 

0.47 

3.6 

2.5 

Hindi 

64.9% 

0.38 

4.1 

2.8 

Russian 

63.5% 

0.62 

1.8 

1.7 


Table 1: % of valid relations and BLEU score of the ex¬ 
tracted relations across languages with the average rela¬ 
tion phrase length (in words). 


Language 

Size 

Language 

Size 

French 

6,743 

Georgian 

497 

Hindi 

367 

Latvian 

491 

Russian 

7,532 

Tagalog 

102 

Chinese 

2,876 

Swahili 

114 

Arabic 

707 

Indonesian 

1,876 


Table 2; Number of extracted relations (in thousands) 
from Wikipedia in 10 languages out of a total of 61. 



Eigure 2: Number of automatically extracted relations 
binned by their BLEU scores computed relative to the 
manually annotated relations. 


age of valid extractions is highest in French (81.6%) 
followed by Hindi and Russian (64.0%). Surpris¬ 
ingly, Russian obtains the lowest percentage of valid 
relations but has the highest BLEU score between 
the automatic and the human extracted relations. 
This could be attributed to the fact that the average 
relation length (in number of words) is the shortest 
for Russian. From table[T] we observe that the length 
of the relation phrase is inversely correlated with the 
BLEU score. 

Figure shows the distribution of the number 
of extracted relations across bins of similar BLEU 
scores. Interestingly, the highest BLEU score bin 


(1) contains the maximum number of relations in 
all three languages. This is an encouraging result 
since it implies that the majority of the extracted re¬ 
lation phrases are identical to the manually anno¬ 
tated relations. Table |2] lists the sizes of automat¬ 
ically extracted relations on 61 different languages 
from Wikipedia that we are going to make publicly 
available. These were selected to include a mix¬ 
ture of high-resource, low-resource, and typologi- 
cally different languages. Table shows examples 
of randomly selected relations in different languages 
along with their English translations. 


4 Related Work 


Cross-lingual projection has been used for transfer 


of syntactic (Yarowsky and Ngai, 2001 Hwa et ah, 
20051 and semantic information ( Riloff et ah, 2002} 
Pado and Lapata, 2009| ). There has been a grow¬ 
ing interest in RE for languages other than English. 


Gamallo et al. (20121 present a dependency-parser 
based open RE system for Spanish, Portuguese and 
Galician. RE systems for Korean have been de¬ 


veloped for both open-domain (Kim et ah, 20111 


and closed-domain (Kim and Lee, 2012 Kim et 


ah, 2014| | using annotation projection. These ap¬ 
proaches use a Korean-English parallel corpus to 
project relations extracted in English to Korean. Fol¬ 
lowing projection, a Korean POS-tagger and a de¬ 
pendency parser are employed to learn a RE system 




















































for Korean. 
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